Lesson 4


setwd("~/MOOCs/Udacity/R Data Science")
pf <- read.csv("pseudo_facebook.tsv", sep='\t')
require(ggplot2)
## Loading required package: ggplot2

Scatterplots and Perceived Audience Size

Notes:


Scatterplots

Notes:

ggplot(aes(x = age, y = friend_count), data = pf) +
  geom_point()


What are some things that you notice right away?

Response:The highest concentration of numbers of friends is for people under the age of thirty. There are also a surprisingly high number of Facebook users over the age of 90 that have high friend counts – probably more than there actually are in reality. There also seems to be a large number of friends for a certain age over 60 but under 90.


ggplot Syntax

Notes:

summary(pf$age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   13.00   20.00   28.00   37.28   50.00  113.00
ggplot(aes(x = age, y = friend_count), data = pf) +
        geom_point() +
        xlim(13,90)
## Warning: Removed 4906 rows containing missing values (geom_point).


Overplotting

Notes: Overplotting makes it difficult to tell how many points are in each region. Adding a ‘alpha = 1/20’ to our geom_point layer means that it will take 20 points to be the equivalent of one of the black dots in the previous plot.

ggplot(aes(x = age, y = friend_count), data = pf) +
        geom_point(alpha=1/20) +
        xlim(13,90)
## Warning: Removed 4906 rows containing missing values (geom_point).

Notes: We can also add a jitter to our plot, because the plots seem to be lining up on top of each other, which is not a true reflection of age. If you look at the zoomed plot, the perfect vertical columns seem intuitively wrong. Jitter adds some noise to get a clearer understanding of age versus friend count.

ggplot(aes(x = age, y = friend_count), data = pf) +
        geom_jitter(alpha=1/20) +
        xlim(13,90)
## Warning: Removed 5183 rows containing missing values (geom_point).

What do you notice in the plot?

Response: I notice that there is an odd spike out near age 70. I also see that the majority of people with high friend counts are under age 30. But now the friend counts are not nearly as high as they were in the previous plot.


Coord_trans()

Notes:

Look up the documentation for coord_trans() and add a layer to the plot that transforms friend_count using the square root function. Create your plot!

ggplot(aes(x = age, y = friend_count), data = pf) +
        geom_jitter(alpha=1/20) +
        xlim(13,90) +
        coord_trans(x = "sqrt")
## Warning: Removed 5183 rows containing missing values (geom_point).

ggplot(aes(x = age, y = friend_count), data = pf) +
        geom_point(alpha=1/20) +
        xlim(13,90) +
        coord_trans(y = "sqrt")
## Warning: Removed 4906 rows containing missing values (geom_point).

What do you notice?

Notes: Coord_trans changes the shape of the plot.


Alpha and Jitter

Notes: Here we can add a jitter quality to a geom point instead of using geom jitter, because geom jitter cannot be layered with a coordinate transformation for sqrt against the y variable. (We can transform x, but not y.) We have to use position = position_jitter(h=0) because if we took a friendship initiated count of zero, added a jitter of noise to our point that ended up being negative, and took the sqrt of that, it could be an imaginary number, which would produce an error.

ggplot(aes(x = age, y = friendships_initiated), data = pf) +
        geom_point(alpha=1/10, position = position_jitter(h=0)) +
        xlim(13,90) +
        coord_trans(y = "sqrt")
## Warning: Removed 5181 rows containing missing values (geom_point).


Overplotting and Domain Knowledge

Notes:


Conditional Means

Notes:

require(dplyr)
## Loading required package: dplyr
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
age_groups <- group_by(pf, age)
pf.fc_by_age <- summarise(age_groups,
          friend_count_mean = mean(friend_count),
          friend_count_median = median(friend_count),
          n = n())
pf.fc_by_age <- arrange(pf.fc_by_age, age)

head(pf.fc_by_age)
## Source: local data frame [6 x 4]
## 
##     age friend_count_mean friend_count_median     n
##   (int)             (dbl)               (dbl) (int)
## 1    13          164.7500                74.0   484
## 2    14          251.3901               132.0  1925
## 3    15          347.6921               161.0  2618
## 4    16          351.9371               171.5  3086
## 5    17          350.3006               156.0  3283
## 6    18          331.1663               162.0  5196
pf.fc_by_age <- pf %>%
        group_by(age)%>%
        summarize(friend_count_mean = mean(friend_count),
                  friend_count_median = median(friend_count),
                  n = n()) %>%
        arrange(age)

head(pf.fc_by_age, 20)
## Source: local data frame [20 x 4]
## 
##      age friend_count_mean friend_count_median     n
##    (int)             (dbl)               (dbl) (int)
## 1     13          164.7500                74.0   484
## 2     14          251.3901               132.0  1925
## 3     15          347.6921               161.0  2618
## 4     16          351.9371               171.5  3086
## 5     17          350.3006               156.0  3283
## 6     18          331.1663               162.0  5196
## 7     19          333.6921               157.0  4391
## 8     20          283.4991               135.0  3769
## 9     21          235.9412               121.0  3671
## 10    22          211.3948               106.0  3032
## 11    23          202.8426                93.0  4404
## 12    24          185.7121                92.0  2827
## 13    25          131.0211                62.0  3641
## 14    26          144.0082                75.0  2815
## 15    27          134.1473                72.0  2240
## 16    28          125.8354                66.0  2364
## 17    29          120.8182                66.0  1936
## 18    30          115.2080                67.5  1716
## 19    31          118.4599                63.0  1694
## 20    32          114.2800                63.0  1443

Create your plot!

ggplot(aes(x = age, y = friend_count_mean), data = pf.fc_by_age) +
        geom_line()


Overlaying Summaries with Raw Data

Notes:

ggplot(aes(x = age, y = friend_count), data = pf) +
        geom_point(alpha=1/10,
                   position = position_jitter(h=0),
                   color = "orange") +
        geom_line(stat = "summary", fun.y = mean) +
        geom_line(stat = "summary", fun.y = quantile, fun.args = list(probs = 0.1), linetype = 2, color = "blue") +
        geom_line(stat = "summary", fun.y = quantile, fun.args = list(probs = 0.9), linetype = 2, color = "blue") +
        geom_line(stat = "summary", fun.y = median, color = "blue") +
        coord_cartesian(xlim = c(13,70), ylim = c(0,1000))

What are some of your observations of the plot?

Response:We can see that the mean is consistently higher than the median value – the data gets skewed by higher outliers. We can also get a better sense of where the probable ranges are for each age group, and how “tight” the distribution happens to be.


Moira: Histogram Summary and Scatterplot

See the Instructor Notes of this video to download Moira’s paper on perceived audience size and to see the final plot.

Notes:


Correlation

Notes:

cor.test(pf$age, pf$friend_count, method = "pearson")$estimate
##         cor 
## -0.02740737
#with(pf, cor.test(age, friend_count, method = "pearson"))$estimate

Look up the documentation for the cor.test function.

What’s the correlation between age and friend count? Round to three decimal places. Response:


Correlation on Subsets

Notes:

with(subset(pf, age <= 70), cor.test(age, friend_count))$estimate
##        cor 
## -0.1717245

Correlation Methods

Notes: the summary statistic tells a story of a negative relationship between age and friend count. But it does not imply causation. To imply causation, we would want to run an experiment and use inferential statistics rather than inferential statistics.


Create Scatterplots

Notes:

ggplot(aes(x = www_likes_received, y = likes_received), data = pf) +
        geom_point(alpha = 1/50,
                   position = position_jitter(h=0),
                   color = "green") +
        geom_line(stat = "summary", fun.y = mean) +
        geom_line(stat = "summary", fun.y = median, color = "red") +
        geom_line(stat = "summary", fun.y = quantile, fun.args = list(probs = 0.1), color = "blue", linetype = 2) +
        geom_line(stat = "summary", fun.y = quantile, fun.args = list(probs = 0.9), color = "blue", linetype = 2) +
        coord_cartesian(xlim = c(0,200), ylim = c(0,200))


Strong Correlations

Notes:

ggplot(aes(x = www_likes_received, y = likes_received), data = pf) +
        geom_point() +
        xlim(0, quantile(pf$www_likes_received, 0.95)) +
        ylim(0, quantile(pf$likes_received, 0.95)) +
        geom_smooth(method = "lm", color = "red")
## Warning: Removed 6075 rows containing non-finite values (stat_smooth).
## Warning: Removed 6075 rows containing missing values (geom_point).

What’s the correlation betwen the two variables? Include the top 5% of values for the variable in the calculation and round to 3 decimal places.

cor.test(pf$www_likes_received, pf$likes_received)$estimate
##       cor 
## 0.9479902

Response: Strong correlations like that can pop up when one set is actually a superset of the other. In the last example, that is what happened. The likes received on a desktop device were correlated with the total likes received, and are highly related by nature. The variables are not independent probably, so we can’t really see what is driving the phenomenon, and that can help us decide which ones to not throw in together for an analysis.


Moira on Correlation

Notes:


More Caution with Correlation

Notes:

#install.packages('alr3')
#library(alr3)

Create your plot!

#data("Mitchell")
#?Mitchell
#write.csv(Mitchell, "Mitchell.csv")

Mitchell <- read.csv("Mitchell.csv")

ggplot(aes(x = Month, y = Temp), data = Mitchell) +
        geom_point()


Noisy Scatterplots

  1. Take a guess for the correlation coefficient for the scatterplot.

I am going to guess that the correlation will be approximately zero, because we are assuming that the check for correlation is looking at a linear model. Temperature data by month will be cyclical, and high temperatures should cancel out low temperatures.

  1. What is the actual correlation of the two variables? (Round to the thousandths place)
cor.test(Mitchell$Month, Mitchell$Temp)$estimate
##        cor 
## 0.05747063

Making Sense of Data

Notes:

ggplot(aes(x = Month, y = Temp), data = Mitchell) +
        geom_point() +
        scale_x_continuous(breaks = seq(0,12*17,12))


A New Perspective

What do you notice? Response: There is a cyclical pattern in the data (like a sin or cosine graph).

Watch the solution video and check out the Instructor Notes! Notes: I was right.


Understanding Noise: Age to Age Months

Notes:

pf$age_with_months <- (pf$age + ((12 - pf$dob_month)/12))

Age with Months Means

Programming Assignment

pf.fc_by_age_months <- pf %>%
        group_by(age_with_months) %>%
        summarize(friend_count_mean = mean(friend_count),
                  friend_count_median = median(friend_count),
                  n = n()) %>%
        arrange(age_with_months)

Noise in Conditional Means

ggplot(aes(x = age_with_months, y = friend_count_mean), data = subset(pf.fc_by_age_months, age_with_months < 71)) +
        geom_line()


Smoothing Conditional Means

Notes: We got two plots. One with age in years and one in age in months. The resolution is different. We have less data to estimate each conditional mean for month bins.

p1 <- ggplot(aes(x = age, y = friend_count_mean), data = subset(pf.fc_by_age, age < 71)) +
        geom_line() +
        geom_smooth()

p2 <- ggplot(aes(x = age_with_months, y = friend_count_mean), data = subset(pf.fc_by_age_months, age_with_months < 71)) +
        geom_line() +
        geom_smooth()

p3 <- ggplot(aes(x = round(age / 5) * 5, y = friend_count), data = subset(pf, age < 71)) +
        geom_line(stat = "summary", fun.y = mean)

library(gridExtra)
grid.arrange(p2, p1, p3, ncol = 1)


Which Plot to Choose?

Notes: Sometimes you don’t have to choose! We can explore the different versions. New versions don’t mean that they are better. When we share work with a larger audience, one or two visualizations can be more powerful than a large portfolio of plots.


Analyzing Two Variables

Reflection: I learned how to deal with generating scatter plots in R. I learned how to jitter graphs and use alpha to get a better look at the density of data points. I learned that correlation is a useful tool, but does not imply causation, nor does it capture all the finer details of what might be happening in a plot. I learned how to zoom into a plot without clipping off data. I learned how to make line graphs, change colors, and plot confidence intervals. I learned how to make data sets of finer resolution, and generate summary data for different categories we might investigate.


Click KnitHTML to see all of your hard work and to have an html page of this lesson, your answers, and your notes!